Search CORE

32 research outputs found

Faster Robust Tensor Power Method for Arbitrary Order

Author: Deng Yichuan
Song Zhao
Yin Junze
Publication venue
Publication date: 01/06/2023
Field of study

Tensor decomposition is a fundamental method used in various areas to deal with high-dimensional data. \emph{Tensor power method} (TPM) is one of the widely-used techniques in the decomposition of tensors. This paper presents a novel tensor power method for decomposing arbitrary order tensors, which overcomes limitations of existing approaches that are often restricted to lower-order (less than

3

) tensors or require strong assumptions about the underlying data structure. We apply sketching method, and we are able to achieve the running time of

\widetilde{O}(n^{p-1})

, on the power

p

and dimension

n

tensor. We provide a detailed analysis for any

p

-th order tensor, which is never given in previous works

arXiv.org e-Print Archive

Attention Scheme Inspired Softmax Regression

Author: Deng Yichuan
Li Zhihang
Song Zhao
Publication venue
Publication date: 26/04/2023
Field of study

Large language models (LLMs) have made transformed changes for human society. One of the key computation in LLMs is the softmax unit. This operation is important in LLMs because it allows the model to generate a distribution over possible next words or phrases, given a sequence of input words. This distribution is then used to select the most likely next word or phrase, based on the probabilities assigned by the model. The softmax unit plays a crucial role in training LLMs, as it allows the model to learn from the data by adjusting the weights and biases of the neural network. In the area of convex optimization such as using central path method to solve linear programming. The softmax function has been used a crucial tool for controlling the progress and stability of potential function [Cohen, Lee and Song STOC 2019, Brand SODA 2020]. In this work, inspired the softmax unit, we define a softmax regression problem. Formally speaking, given a matrix

A \in \mathbb{R}^{n \times d}

and a vector

b \in \mathbb{R}^n

, the goal is to use greedy type algorithm to solve \begin{align*} \min_{x} \| \langle \exp(Ax), {\bf 1}_n \rangle^{-1} \exp(Ax) - b \|_2^2. \end{align*} In certain sense, our provable convergence result provides theoretical support for why we can use greedy algorithm to train softmax function in practice

arXiv.org e-Print Archive

Convergence of Two-Layer Regression with Nonlinear Units

Author: Deng Yichuan
Song Zhao
Xie Shenghao
Publication venue
Publication date: 16/08/2023
Field of study

Large language models (LLMs), such as ChatGPT and GPT4, have shown outstanding performance in many human life task. Attention computation plays an important role in training LLMs. Softmax unit and ReLU unit are the key structure in attention computation. Inspired by them, we put forward a softmax ReLU regression problem. Generally speaking, our goal is to find an optimal solution to the regression problem involving the ReLU unit. In this work, we calculate a close form representation for the Hessian of the loss function. Under certain assumptions, we prove the Lipschitz continuous and the PSDness of the Hessian. Then, we introduce an greedy algorithm based on approximate Newton method, which converges in the sense of the distance to optimal solution. Last, We relax the Lipschitz condition and prove the convergence in the sense of loss value

arXiv.org e-Print Archive

Randomized and Deterministic Attention Sparsification Algorithms for Over-parameterized Feature Dimension

Author: Deng Yichuan
Mahadevan Sridhar
Song Zhao
Publication venue
Publication date: 10/04/2023
Field of study

Large language models (LLMs) have shown their power in different areas. Attention computation, as an important subroutine of LLMs, has also attracted interests in theory. Recently the static computation and dynamic maintenance of attention matrix has been studied by [Alman and Song 2023] and [Brand, Song and Zhou 2023] from both algorithmic perspective and hardness perspective. In this work, we consider the sparsification of the attention problem. We make one simplification which is the logit matrix is symmetric. Let

n

denote the length of sentence, let

d

denote the embedding dimension. Given a matrix

X \in \mathbb{R}^{n \times d}

, suppose

d \gg n

and

\| X X^\top \|_{\infty} < r

with

r \in (0,0.1)

, then we aim for finding

Y \in \mathbb{R}^{n \times m}

(where

m\ll d

) such that \begin{align*} \| D(Y)^{-1} \exp( Y Y^\top ) - D(X)^{-1} \exp( X X^\top) \|_{\infty} \leq O(r) \end{align*} We provide two results for this problem.

\bullet

Our first result is a randomized algorithm. It runs in

\widetilde{O}(\mathrm{nnz}(X) + n^{\omega} )

time, has

1-\delta

succeed probability, and chooses

m = O(n \log(n/\delta))

. Here

\mathrm{nnz}(X)

denotes the number of non-zero entries in

X

. We use

\omega

to denote the exponent of matrix multiplication. Currently

\omega \approx 2.373

\bullet

Our second result is a deterministic algorithm. It runs in

\widetilde{O}(\min\{\sum_{i\in[d]}\mathrm{nnz}(X_i)^2, dn^{\omega-1}\} + n^{\omega+1})

time and chooses

m = O(n)

. Here

X_i

denote the

i

-th column of matrix

X

. Our main findings have the following implication for applied LLMs task: for any super large feature dimension, we can reduce it down to the size nearly linear in length of sentence

arXiv.org e-Print Archive

Superiority of Softmax: Unveiling the Performance Edge Over Linear Attention

Author: Deng Yichuan
Song Zhao
Zhou Tianyi
Publication venue
Publication date: 17/10/2023
Field of study

Large transformer models have achieved state-of-the-art results in numerous natural language processing tasks. Among the pivotal components of the transformer architecture, the attention mechanism plays a crucial role in capturing token interactions within sequences through the utilization of softmax function. Conversely, linear attention presents a more computationally efficient alternative by approximating the softmax operation with linear complexity. However, it exhibits substantial performance degradation when compared to the traditional softmax attention mechanism. In this paper, we bridge the gap in our theoretical understanding of the reasons behind the practical performance gap between softmax and linear attention. By conducting a comprehensive comparative analysis of these two attention mechanisms, we shed light on the underlying reasons for why softmax attention outperforms linear attention in most scenarios

arXiv.org e-Print Archive

Clustered Linear Contextual Bandits with Knapsacks

Author: Deng Yichuan
Mamakos Michalis
Song Zhao
Publication venue
Publication date: 21/08/2023
Field of study

In this work, we study clustered contextual bandits where rewards and resource consumption are the outcomes of cluster-specific linear models. The arms are divided in clusters, with the cluster memberships being unknown to an algorithm. Pulling an arm in a time period results in a reward and in consumption for each one of multiple resources, and with the total consumption of any resource exceeding a constraint implying the termination of the algorithm. Thus, maximizing the total reward requires learning not only models about the reward and the resource consumption, but also cluster memberships. We provide an algorithm that achieves regret sublinear in the number of time periods, without requiring access to all of the arms. In particular, we show that it suffices to perform clustering only once to a randomly selected subset of the arms. To achieve this result, we provide a sophisticated combination of techniques from the literature of econometrics and of bandits with constraints

arXiv.org e-Print Archive

Introduction to the Minitrack on The Technical, Socio-Economic and Ethical Aspects of AI

Author: Deng Xuefei
Li Yibai
Wang Yichuan
Publication venue: AIS Electronic Library (AISeL)
Publication date: 01/01/2020
Field of study

Crossref

ScholarSpace at University of Hawai'i at Manoa

AIS Electronic Library (AISeL)

Solving Tensor Low Cycle Rank Approximation

Author: Deng Yichuan
Gao Yeqi
Song Zhao
Publication venue
Publication date: 13/04/2023
Field of study

Large language models have become ubiquitous in modern life, finding applications in various domains such as natural language processing, language translation, and speech recognition. Recently, a breakthrough work [Zhao, Panigrahi, Ge, and Arora Arxiv 2023] explains the attention model from probabilistic context-free grammar (PCFG). One of the central computation task for computing probability in PCFG is formulating a particular tensor low rank approximation problem, we can call it tensor cycle rank. Given an

n \times n \times n

third order tensor

A

, we say that

A

has cycle rank-

k

if there exists three

n \times k^2

size matrices

U , V

, and

W

such that for each entry in each \begin{align*} A_{a,b,c} = \sum_{i=1}^k \sum_{j=1}^k \sum_{l=1}^k U_{a,i+k(j-1)} \otimes V_{b, j + k(l-1)} \otimes W_{c, l + k(i-1) } \end{align*} for all

a \in [n], b \in [n], c \in [n]

. For the tensor classical rank, tucker rank and train rank, it has been well studied in [Song, Woodruff, Zhong SODA 2019]. In this paper, we generalize the previous ``rotation and sketch'' technique in page 186 of [Song, Woodruff, Zhong SODA 2019] and show an input sparsity time algorithm for cycle rank

arXiv.org e-Print Archive

Introduction to the Minitrack on Augmenting Human Intelligence: Artificially, Socially, and Ethically

Author: Deng Xuefei (Nancy)
Li Yibai
Wang Yichuan
Publication venue: AIS Electronic Library (AISeL)
Publication date: 01/01/2019
Field of study

Crossref

ScholarSpace at University of Hawai'i at Manoa

AIS Electronic Library (AISeL)

Unmasking Transformers: A Theoretical Approach to Data Recovery via Attention Weights

Author: Deng Yichuan
Song Zhao
Xie Shenghao
Yang Chiwun
Publication venue
Publication date: 19/10/2023
Field of study

In the realm of deep learning, transformers have emerged as a dominant architecture, particularly in natural language processing tasks. However, with their widespread adoption, concerns regarding the security and privacy of the data processed by these models have arisen. In this paper, we address a pivotal question: Can the data fed into transformers be recovered using their attention weights and outputs? We introduce a theoretical framework to tackle this problem. Specifically, we present an algorithm that aims to recover the input data

X \in \mathbb{R}^{d \times n}

from given attention weights

W = QK^\top \in \mathbb{R}^{d \times d}

and output

B \in \mathbb{R}^{n \times n}

by minimizing the loss function

L(X)

. This loss function captures the discrepancy between the expected output and the actual output of the transformer. Our findings have significant implications for the Localized Layer-wise Mechanism (LLM), suggesting potential vulnerabilities in the model's design from a security and privacy perspective. This work underscores the importance of understanding and safeguarding the internal workings of transformers to ensure the confidentiality of processed data

arXiv.org e-Print Archive